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Abstract 

The modem scale of data has brought new challenges to Bayesian inference. In 
particular, conventional MCMC algorithms are computationally very expensive 
for large data sets. A promising approach to solve this problem is embarrassingly 
parallel MCMC (EP-MCMC), which first partitions the data into multiple subsets 
and mns independent sampling algorithms on each subset. The subset posterior 
draws are then aggregated via some combining rules to obtain the final approxima¬ 
tion. Existing EP-MCMC algorithms are limited by approximation accuracy and 
difficulty in resampling. In this article, we propose a new EP-MCMC algorithm 
PART that solves these problems. The new algorithm applies random partition 
trees to combine the subset posterior draws, which is distribution-free, easy to re¬ 
sample from and can adapt to multiple scales. We provide theoretical justification 
and extensive experiments illustrating empirical performance. 


1 Introduction 

Bayesian methods are popular for their success in analyzing complex data sets. However, for large 
data sets, Markov Chain Monte Carlo (MCMC) algorithms, widely used in Bayesian inference, can 
suffer from huge computational expense. With large data, there is increasing time per iteration, in¬ 
creasing time to convergence, and difficulties with processing the full data on a single machine due 
to memory limits. To ameliorate these concerns, various methods such as stochastic gradient Monte 
Carlo [1] and sub-sampling based Monte Carlo [2] have been proposed. Among directions that have 
been explored, embarrassingly parallel MCMC (EP-MCMC) seems most promising. EP-MCMC 
algorithms typically divide the data into multiple subsets and run independent MCMC chains si¬ 
multaneously on each subset. The posterior draws are then aggregated according to some rules to 
produce the final approximation. This approach is clearly more efficient as now each chain involves 
a much smaller data set and the sampling is communication-free. The key to a successful EP-MCMC 
algorithm lies in the speed and accuracy of the combining rule. 

Existing EP-MCMC algorithms can be roughly divided into three categories. The first relies on 
asymptotic normality of posterior distributions. [3] propose a “Consensus Monte Carlo” algorithm, 
which produces final approximation by a weighted averaging over all subset draws. This approach is 
effective when the posterior distributions are close to Gaussian, but could suffer from huge bias when 
skewness and multi-modes are present. The second category relies on calculating an appropriate 
variant of a mean or median of the subset posterior measures [4, 5]. These approaches rely on 
asymptotics (size of data increasing to infinity) to justify accuracy, and lack guarantees in finite 
samples. The third category relies on the product density equation (PDE) in (1). Assuming X is the 
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observed data and 0 is the parameter of interest, when the observations are iid conditioned on 0, for 
any partition of X = U U • • • U X^'^\ the following identity holds, 

p{e\X) oc 7r(6>)p(X|6>) oc p(6>|X(i))p(6»|X(2)).. (1) 

if the prior on the full data and subsets satisfy 7r(6>) = Hili [^1 proposes using kernel density 

estimation on each subset posterior and then combining via (1). They use an independent Metropolis 
sampler to resample from the combined density. [7] apply the Weierstrass transform directly to (1) 
and developed two sampling algorithms based on the transformed density. These methods guarantee 
the approximation density converges to the true posterior density as the number of posterior draws 
increase. However, as both are kernel-based, the two methods are limited by two major drawbacks. 
The first is the inefficiency of resampling. Kernel density estimators are essentially mixture distri¬ 
butions. Assuming we have collected 10,000 posterior samples on each machine, then multiplying 
just two densities already yields a mixture distribution containing 10^ components, each of which 
is associated with a different weight. The resampling requires the independent Metropolis sampler 
to search over an exponential number of mixture components and it is likely to get stuck at one 
“good” component, resulting in high rejection rates and slow mixing. The second is the sensitivity 
to bandwidth choice, with one bandwidth applied to the whole space. 

In this article, we propose a novel EP-MCMC algorithm termed “parallel aggregation random trees” 
(PART), which solves the above two problems. The algorithm inhibits the explosion of mixture 
components so that the aggregated density is easy to resample. In addition, the density estimator is 
able to adapt to multiple scales and thus achieve better approximation accuracy. In Section 2, we 
motivate the new methodology and present the algorithm. In Section 3, we present error bounds and 
prove consistency of PART in the number of posterior draws. Experimental results are presented in 
Section 4. Proofs and part of the numerical results are provided in the supplementary materials. 

2 Method 


Recall the PDE identity (1) in the introduction. When data set X is partitioned into m subsets 
X = X^^^ U • • • U X^'^\ the posterior distribution of the i* subset can be written as 

/W(6>) oc 7r(6>)i/XxW|6»), (2) 

where tt{0) is the prior assigned to the full data set. Assuming observations are iid given 0, the 
relationship between the full data posterior and subset posteriors is captured by 

m m 

p{0\X) oc 7r(6») l[p{X^^^\0)^l[f^^HO). (3) 

i = l i=l 

Due to the fiaws of applying kernel-based density estimation to (3) mentioned above, we propose 
to use random partition trees or multi-scale histograms. Let Xjc be the collection of all Re¬ 
partitions formed by K disjoint rectangular blocks, where a rectangular block takes the form of 

def 

Ak = rk,i] X {lk,2,rk,2] X • • • {lk,p, rk,p] C Rp for some lk,q < rk,q. A AT-block histogram 
is then defined as 

where {Aj^ : k = - ,X} G Xjc are the blocks and X, are the total number of posterior 

samples on the subset and of those inside the block Aj^ respectively (assuming the same N across 
subsets). We use | • | to denote the area of a block. Assuming each subset posterior is approximated 
by a X-block histogram, if the partition {A^} is restricted to be the same across all subsets, then the 
aggregated density after applying (3) is still a X-block histogram (illustrated in the supplement). 


^ m ^ / m 

pm) = -i[p\e) = -Y,(ll 


1{9 e Ak) = y^wfcgfc(6>), 


where Z = Hlli /\Ak\^ ^ is the normalizing constant, rc/^’s are the updated weights, 

and gk{0) = unif(6>; Ak) is the block-wise distribution. Common histogram blocks across subsets 
control the number of mixture components, leading to simple aggregation and resampling proce¬ 
dures. Our PART algorithm consists of space partitioning followed by density aggregation, with 
aggregation simply multiplying densities across subsets for each block and then normalizing. 
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2.1 Space Partitioning 


To find good partitions, our algorithm recursively bisects (not necessarily evenly) a previous block 
along a randomly selected dimension, subject to certain rules. Such partitioning is multi-scale and 
related to wavelets [8]. Assume we are currently splitting the block A along the dimension q and 
denote the posterior samples in A by {Oj for the subset. The cut point on dimension q is 

determined by a partition rule }, * * * , resulting two blocks are subject 

to further bisecting under the same procedure until one of the following stopping criteria is met — 
(i) rik/N < 5pOY (ii) the area of the block \Aj^ \ becomes smaller than S\A\- The algorithm returns a 
tree with K leafs, each corresponding to a block Ak. Details are provided in Algorithm 1. 


Algorithm 1 Partition tree algorithm 


1 

procedure BuildTree({«^4> ' ’' > 0(-)> 5a, N, L, R) 

2 

D ^ {1, 2, • • • ,p} 


3 

while D not empty do 


4 

Draw q uniformly at random from D. 

> Randomly choose the dimension to cut 

5 

r.A 

7 G- Cardinality of for all i 

6 

i{0* -Lg> 6a, Rg - et > 6a and min(E,- < 0*), Ei 2 > 0*)) > N6p 


for all i then 


7 

L' ^ L, L'g ^ e*, R' ^RR'g^ 0* 

> Update left and right boundaries 

8 

T.c A- BuiLDTREE({6>j^^ : < 6>*}, • 


9 

T.u A- BuiLDTREE({6>j^^ : > 6>*}, • 

•• ,{T ■SA’ ,N,L',R) 

10 

return T 

11 

else 


12 

D^D\{q} 

\> Try cutting at another dimension 

13 

end if 


14 

end while 


15 

T.C A- NULL, r.n A- NULL, return T 

> Leaf node 

16 

end procedure 



In Algorithm becomes the minimum edge length of a block Sa (possibly different across di¬ 

mensions). Quantities L, G are the left and right boundaries of the samples respectively, which 
take the sample minimum/maximum when the support is unbounded. We consider two choices for 
the partition rule 0(-) — maximum (empirical) likelihood partition (ML) and median/KD-tree par¬ 
tition (KD). 


Maximum Likelihood Partition (ML) ML-partition searches for partitions by greedily maximiz¬ 
ing the empirical log likelihood at each iteration. For m = 1 we have 


0* = = !,■■■ ,n}) 


arg max 

ni-\-n2=n,AiUA2=A 


ni 

n\Ai\ 


n2 

n\A2\ 


n2 


( 6 ) 


where ni and n 2 are counts of posterior samples in Ai and A 2 , respectively. The solution to (6) falls 
inside the set {Oj}. Thus, a simple linear search after sorting samples suffices (by book-keeping the 
ordering, sorting the whole block once is enough for the entire procedure). For m > 1, we have 


Z>g,ML 


(•) = arg max 




Ai) 




Ai) 


Ai) 


nA)\A2 


^2 


(7) 


similarly solved by a linear search. This is dominated by sorting and takes 0{n log n) time. 


Median/KD-Tree Partition (KD) Median/KD-tree partition cuts at the empirical median of pos¬ 
terior samples. When there are multiple subsets, the median is taken over pooled samples to force 
{Ak} to be the same across subsets. Searching for median takes 0{n) time [9], which is faster than 
ML-partition especially when the number of posterior draws is large. The same partitioning strategy 
is adopted by KD-trees [10]. 
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2.2 Density Aggregation 


Given a common partition, Algorithm 2 aggregates all subsets in one stage. However, assuming a 
single “good” partition for all subsets is overly restrictive when m is large. Hence, we also consider 
pairwise aggregation [6, 7], which recursively groups subsets into pairs, combines each pair with 
Algorithm 2, and repeats until one final set is obtained. Run time of PART is dominated by space 
partitioning (BuildTree), with normalization and resampling very fast. 


Algorithm 2 Density aggregation algorithm (drawing N' samples from the aggregated posterior) 


1 

procedure ONESTAGEAGGREGATE({6>j^^}, {of^}, ■ 

■■ ,{e‘p^},4>{-),5p,Sa,N, N',L, R) 

2 

T ^ BuildTree({0W}, {0f}, ■ ■ ■ , </.(■), Sp, Sa, N, L, R), Z^O 

3 

{{Ak}, {n^k}) ^ TraverseLeaf(T) 


4 

for A: = 1, 2, • • • , AT do 


5 

^ni^iV ^,z^z-^wk 

> Multiply inside each block 

6 

end for 


7 

Wk ^ W]^/Z for all k 

> Normalize 

8 

for t = 1, 2, • • • ,N' do 


9 

Draw k with weights {wk} and then draw Of ^ 

9k{0) 

10 

end for 


11 

return {6>i,6>2, • * * ,0n'} 


12 

end procedure 



2.3 Variance Reduction and Smoothing 

Random Tree Ensemble Inspired by random forests [ 11 , 12 ], the full posterior is estimated by 
averaging T independent trees output by Algorithm 1. Smoothing and averaging can reduce vari¬ 
ance and yield better approximation accuracy. The trees can be built in parallel and resampling in 
Algorithm 2 only additionally requires picking a tree uniformly at random. 

Local Gaussian Smoothing As another approach to increase smoothness, the blockwise uniform 
distribution in (5) can be replaced by a Gaussian distribution = N{0] /i/^, S/c), with mean and 
covariance estimated “locally” by samples within the block. A multiplied Gaussian approximation 

is used: S/e = {YhLi ^ Tk = ^k{YJiLi where and are estimated 

with the subset. We apply both random tree ensembles and local Gaussian smoothing in all 
applications of PART in this article unless explicitly stated otherwise. 


3 Theory 

In this section, we provide consistency theory (in the number of posterior samples) for histograms 
and the aggregated density. We do not consider the variance reduction and smoothing modifications 
in these developments for simplicity in exposition, but extensions are possible. Section 3.1 pro¬ 
vides error bounds on ML and KD-tree partitioning-based histogram density estimators constructed 
from N independent samples from a single joint posterior; modified bounds can be obtained for 
MCMC samples incorporating the mixing rate, but will not be considered here. Section 3.2 then 
provides corresponding error bounds for our PART-aggregrated density estimators in the one-stage 
and pairwise cases. Detailed proofs are provided in the supplementary materials. 

Let f{0) be a p-dimensional posterior density function. Assume / is supported on a measurable 
set ft CW. Since one can always transform to a bounded region by scaling, we simply assume 
ft = [0,1]^ as in [8, 13] without loss of generality. We also assume that / G C^(f^). 


4 





3.1 Space partitioning 

Maximum likelihood partition (ML) For a given K, ML partition solves the following problem: 


/ml = arg max ^ log Uk/N > cop, \Ak\> p/ D, (8) 

for some cq and p, where D = \\f\\oo < oo. We have the following result. 

Theorem 1. Choose p = \ j, For any ^ > 0, if the sample size satisfies that N > 
2(1 — log(2Ff/^), then with probability at least 1 — 5, the optimal solution to (8) 

satisfies that 


DKLifWfML) < {Ci + 2\ogK)K 2p +C2max{logL>,21ogK} 




where Ci = \ogD + ^pLD with L = ||/'||(X) C 2 = 48v^p + 1. 


When multiple densities - j(m) presented, our goal of imposing the same partition 

on all functions requires solving a different problem, 


(/ ml)^i = argmax 


m 

E 


K 


— E 


ny log 


n 


(i) 


s.t. ny/Ni > cop, \Ak\ > p/D, (9) 


where Ni is the number of posterior samples for function A similar result as Theorem 1 for (9) 
is provided in the supplementary materials. 


Median partition/KD-tree (KD) The KD-tree /kd cuts at the empirical median for different 
dimensions. We have the following result. 

Theorem 2. For any s > 0, define = log 2 + 2 + 3 L/£ ^ ^ ^ 0, if N > 

32(log K^K log{2K/5), then with probability at least 1 — have 

WIkd - IkdWi < s+pLK~^‘/P + 4elogK^^\og 

If f{0) is further lower bounded by some constant bo > 0, we can then obtain an upper bound on 
the KL-divergence. Define = log 2 f 1 + 2 + 3 L/bo l have 


DklUWIkd) < 


^^if-’-».o/p + 8elogii: 
bo 



When there are multiple functions and the median partition is performed on pooled data, the partition 
might not happen at the empirical median on each subset. However, as long as the partition quantiles 
are upper and lower bounded by a and 1 — a for some a e [1/2,1), we can establish results similar 
to Theorem 2. The result is provided in the supplementary materials. 


3.2 Posterior aggregation 

The previous section provides estimation error bounds on individual posterior densities, through 
which we can bound the distance between the true posterior conditional on the full data set and the 
aggregated density via (3). Assume we have m density functions , / = 1, 2, • • • , m} and intend 
to approximate their aggregated density // = fliG/ / / ^iei where / = {1, 2, • • • , m}. 
Notice that for any /' C /, fj, = p[0\ Uie/' Let D = max//c/ ||//' Hoo^ i-C., D is an upper 

bound on all posterior densities formed by a subset of X. Also define Zjf = f fliG/' f^^\ These 
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quantities depend only on the model and the observed data (not posterior samples). We denote /ml 
and /kl> by / as the following results apply similarly to both methods. 

The “one-stage” aggregation (Algorithm 2) first obtains an approximation for each (via either 
ML-partition or KD-partition) and then computes // = Yliei 

Theorem 3 (One-stage aggregation). Denote the average total variation distance between and 
}yy Assume the conditions in Theorem 1 and 2 and for ML-partition 

Vn > 32cp ^y2{p + ^log log 

and for KD-partition 

N > l23e^K{\ogKf\og{K/5). 

Then with high probability the total variation distance between // and fj is bounded by 

\\h-fi\\i<fm{2Dr-h, 

where Zj is a constant that does not depend on the posterior samples. 

The approximation error of Algorithm 2 increases dramatically with the number of subsets. To 
ameliorate this, we introduce the pairwise aggregation strategy in Section 2, for which we have the 
following result. 

Theorem 4 (Pairwise aggregation). Denote the average total variation distance between f^'^^ and 
by e. Assume the conditions in Theorem 3. Then with high probability the total variation 
distance between fj and fj is bounded by 

II//-//111 < 

where Cq = max///(^ 7 /c 7 is a constant that does not depend on posterior samples. 


4 Experiments 

In this section, we evaluate the empirical performance of PART and compare the two algorithms 
PART-KD and PART-ML to the following posterior aggregation algorithms. 

1. Simple averaging (average): each aggregated sample is an arithmetic average of M sam¬ 
ples coming from M subsets. 

2. Weighted averaging (weighted): also called Consensus Monte Carlo algorithm [3], 
where each aggregated sample is a weighted average of M samples. The weights are 
optimally chosen for a Gaussian posterior. 

3. Weierstrass rejection sampler (Weierstrass): subset posterior samples are passed through 
a rejection sampler based on the Weierstrass transform to produce the aggregated sam¬ 
ples [7]. We use its R package^ for experiments. 

4. Parametric density product (parametric): aggregated samples are drawn from a multi¬ 
variate Gaussian, which is a product of Laplacian approximations to subset posteriors [6]. 

5. Nonparametric density product (nonparametric): aggregated posterior is approximated 
by a product of kernel density estimates of subset posteriors [6]. Samples are drawn with 
an independent Metropolis sampler. 

6. Semiparametric density product (semiparametric): similar to the nonparametric, but 
with subset posteriors estimated semiparametrically [6, 14]. 


^https://github.com/wwrechard/weierstrass 
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All experiments except the two toy examples use adaptive MCMC [15, 16] ^ for posterior sampling. 
For PART-KD/ML, one-stage aggregation (Algorithm 2) is used only for the toy examples (results 
from pairwise aggregation are provided in the supplement). For other experiments, pairwise aggre¬ 
gation is used, which draws 50,000 samples for intermediate stages and halves 6p after each stage 
to refine the resolution (The value of 6p listed below is for the final stage). The random ensemble of 
PART consists of 40 trees. 

4.1 Two Toy Examples 

The two toy examples highlight the performance of our methods in terms of (i) recovering multiple 
modes and (ii) correctly locating posterior mass when subset posteriors are heterogeneous. The 
PART-KD/PART-ML results are obtained from Algorithm 2 without local Gaussian smoothing. 

Bimodal Example Figure 7 shows an example consisting of m = 10 subsets. Each subset 
consists of 10,000 samples drawn from a mixture of two univariate normals 0.27A^(/ii,i, cr|^) + 
0.73A^(/ii,2, cr 2 ^ 2 )’ with the means and standard deviations slightly different across subsets, given by 
= 5 + e ^^2 and = 1 + cr ^,2 = 4 + where ^ ^(0? 0.5), 

6i^i A^(0,0.1) independently for m = 1, • • • ,10 and I = 1,2. The resulting true combined pos¬ 
terior (red solid) consists of two modes with different scales. In Figure 7, the left panel shows the 
subset posteriors (dashed) and the true posterior; the right panel compares the results with various 
methods to the truth. A few are omitted in the graph: average and weighted average overlap with 
parametric, and Weierstrass overlaps with PART-KD/PART-ML. 



Figure 1: Bimodal posterior combined from 10 subsets. Left: the true posterior and subset posteriors 
(dashed). Right: aggregated posterior output by various methods compared to the truth. Results are 
based on 10,000 aggregated samples. 

Rare Bernoulli Example We consider = 10,000 Bernoulli trials Xi Ber(6>) split into m = 
15 subsets. The parameter 0 is chosen to be 2m jN so that on average each subset only contains 
2 successes. By random partitioning, the subset posteriors are rather heterogeneous as plotted in 
dashed lines in the left panel of Figure 8. The prior is set as 7r{0) = Beta(^; 2, 2). The right 
panel of Figure 8 compares the results of various methods. PART-KD, PART-ML and Weierstrass 
capture the true posterior shape, while parametric, average and weighted average are all biased. The 
nonparametric and semiparametric methods produce fiat densities near zero (not visible in Figure 8 
due to the scale). 

4.2 Bayesian Logistic Regression 

Synthetic dataset The dataset {(x^,^i)}^^ consists of = 50,000 observations in p = 50 
dimensions. All features G are drawn from A/p_i(0, 5]) withp = 50 and = O.QI^^^L 
The model intercept is set to —3 and the other coefficient 0*’s are drawn randomly from A^(0, 5^). 
Conditional on x^, i/i G {0,1} follows p{yi = 1) = 1/(1 + exp(— 0*^[1, x^])). The dataset is 
randomly split into m = 40 subsets. For both full chain and subset chains, we run adaptive MCMC 
for 200,000 iterations after 100,000 burn-in. Thinning by 4 results in T = 50, 000 samples. 

The samples from the full chain (denoted as {6j}J=i) are treated as the ground truth. To compare the 
accuracy of different methods, we resample T points {9j} from each aggregated posterior and then 

^http://hellos.fmi.f1/~lainema/mcmc/ 
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Figure 2: The posterior for the probability 6> of a rare event. Left: the full posterior (solid) and 
m = 15 subset posteriors (dashed). Right: aggregated posterior output by various methods. All 
results are based on 20,000 aggregated samples. 
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compare them using the following metrics: (1) RMSE of posterior mean \\^Q2j Oj — ^j )\\2 

(2) approximate KL divergence D}^{p{6)\\p{6)) and D}^{p{6)\\p{6)), where p and p are both 
approximated by multivariate Gaussians (3) the posterior concentration ratio, defined as r = 

/Sj ll^j ~ ^*111’ which measures how posterior spreads out around the true 

value (with r = 1 being ideal). The result is provided in Table 1. Figure 4 shows the Dkl{p\\p) 
versus the length of subset chains supplied to the aggregation algorithm. The results of PART are 
obtained with 6p = 0.001, Sa = 0.0001 and 40 trees. Figure 3 showcases the aggregated posterior 
for two parameters in terms of joint and marginal distributions. 


Method 

RMSE 

-Dkl(pIIp) 

£’KL(p|b) 

r 

6 

4 


I. I 

PART (KD) 

0.587 

3.95 X 10^ 

6.45 X 10^ 

3.94 

2 

0 


ll J 

PART (ML) 

1.399 

8.05 X 10^ 

5.47 X 10^ 

9.17 

25 


- Semiparametric 

average 

29.93 

2.53 X 10^ 

5.41 X 10^ 

184.62 

20 


— Weierstrass 

- Average 

- Weighted 

weighted 

38.28 

2.60 X 10^ 

2.53 X 10® 

236.15 

f) 

s 


Weierstrass 

6.47 

7.20 X 10^ 

2.62 X 10® 

39.96 

^17 

10 



parametric 

10.07 

2.46 X 10^ 

6.12 X 10® 

62.13 

5 


« 

t 

nonparametric 

25.59 

3.40 X 10^ 

3.95 X 10^ 

157.86 

0 


% 

semiparametric 

25.45 

2.06 X 10^ 

3.90 X 10^ 

156.97 


iO -20 

-10 


■ -True 
— PART-KD 
— PART-ML 
—Parametric 
— Nonparametric 



Table 1: Accuracy of posterior aggregation on logistic regression. Figure 3: Posterior of 0i and 0i 


Real datasets We also run experiments on two real datasets: (1) the Covertype dataset^ [17] con¬ 
sists of 581,012 observations in 54 dimensions, and the task is to predict the type of forest cover 
with cartographic measurements; (2) the MiniBooNE dataset"^ [18, 19] consists of 130,065 observa¬ 
tions in 50 dimensions, whose task is to distinguish electron neutrinos from muon neutrinos with 
experimental data. For both datasets, we reserve 1/5 of the data as the test set. The training set 
is randomly split into m = 50 and m = 25 subsets respectively for covertype and MiniBooNE. 
Figure 9 shows the prediction accuracy versus total runtime (parallel subset MCMC aggregation 
time) for different methods. For each MCMC chain, the first 20% iterations are discarded before ag¬ 
gregation as burn-in. The aggregated chain is required to be of the same length as the subset chains. 
As a reference, we also plot the result for the full chain and lasso [20] run on the full training set. 

5 Conclusion 


In this article, we propose a new embarrassingly-parallel MCMC algorithm PART that can efficiently 
draw posterior samples for large data sets. PART is simple to implement, efficient in subset com¬ 
bining and has theoretical guarantees. Compared to existing EP-MCMC algorithms, PART has sub¬ 
stantially improved performance. Possible future directions include (1) exploring other multi-scale 
density estimators which share similar properties as partition trees but with a better approximation 

^http://www.csie.ntu.edu.tw/~ cjlin/libsvmtools/datasets/binary.html 
^https://archive.ics.uci.edu/mi/machine-iearning-databases/OOi99 
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Figure 4: Approximate KL diver¬ 
gence between the full chain and 
the combined posterior versus the 
length of subset chains. 


Figure 5: Prediction accuracy versus total runtime (running 
chain -i- aggregation) on Covertype and MiniBooNE datasets 
(semiparametric is not compared due to its long running time). 
Plots against the length of chain are provided in the supplement. 


accuracy (2) developing a tuning procedure for choosing good Sp and Sa, which are essential to the 

performance of PART. 
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Supplementary Materials 


Appendix A: Schematic illustration of the algorithm 

Here we include a schematic illustration of the density aggregation in PART algorithm. 



Figure 6: A schematic figure illustrating the density aggregation step of the algorithm. Two trees 
in the left share the same block structure and the aggregated histogram is obtained by block-wise 
multiplication and renormalization. 


Appendix B: Proof of Theorem 1 


Let Tk,p be a subset of T^ defined as Tk,p = {/o ^ > p}. We prove a more 

general form of Theorem 1 here. 

Theorem 5. For any > 0, if the sample size satisfies that N > max{Ff, ihen with 

probability at least 1 — 5, the optimal solution to (8) satisfies that 

DKLifWfML) < DKLifWfo) + ^ log log , 

where Cp = + 1 max { log D, log }. 

Now choosing p = Xjthe condition becomes N > 2(1 — \og{2K/5), 

then with probability at least 1 — 5 we have 


DKL{f\\fML) < (Ci + 21ogi^)i^-^ +C2max{logA21ogi^} 




where Ci =2 log D + ApLD with L = ||/'||c» <^J^d C 2 = 4:S^/p~f^. 


When multiple densities - j(m) ^ presented, our goal of imposing the same partition 

on all functions requires solving a different problem (9). As long as E /ml — ® fpp 

remains true, where = argmin^^^p^ ^ II/o)^ the whole proof of Theorem 1 is also 

valid for (9). Therefore, we have the following Corollary. 

Corollary 1 (m copies). For any 5 > 0, if the sample size satisfies that N > 2(1 — 
log(2mFf/^) and p = 1/^ then with probability at least 1 — 5, the optimal 
solution to (9) satisfies that 

^ {Ci + 2\ogK)K~^ + C 2 max { log D,2\ogK}^^ log log • 
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To prove Theorem 1 we need the following lemmas. 

Lemma 1. The optimal solution of (8) is also the optimal solution of the following problem, 


^ K K 

/ml = argmax —log TT/c, s.t. rik > copN, \Ak\ > p/D and y^7rk\Ak\ = 1. (10) 

feTr. 


k=l 


k=l 


Proof We write out the empirical log likelihood 

-^^nfclog7rfe = ^^log7rfc|Afc|- ^ '^\og\Ak\. 


k=l 


k=l 




For any fixed partition {Ak}, under the constraint that Ylk=i ^k\^k\ = 1, one can easily see that 


^k\Ak\ = 


N 


maximizes the result. 


□ 


Next, we show that the optimal approximation 

fK,p= argmin Dkl(/||/ml) 

/mI/GF k,p 

is a feasible solution to (10) with a high probability. 

Lemma 2. Let fK,p be the optimal approximation in Tk,p, then fx^p satisfies that min/^ \Ak\ > 
p/D. In addition, with probability at least 1 — K exp(—(1 — Ciff‘pN/2), we have Uk/N > cop, 
be., fK,p is a feasible soluton of (10). 


Proof Let rik be the counts of data points on the partition of fK,p- Notice fK,p is a fixed func¬ 
tion that does not depend on the data. Therefore, each Uk follows a binomial distribution. Define 
P{Ak) = According to the definition of Tk,p, we have P{Ak) > p. Using the Chemoff’s 

inequality, we have for any 0 < ^ < 1, 

Taking ^ = 1 — cq and union bounds we have 

P^rnin ^ > copj >l-K exp(-(l - cofpN/2). 

On the other hand, the following inequality shows the bound on \Ak\, 

\Ak\= [ 1> / f/D>p/D. 

J Ak I Ak 


□ 

Lemma 2 states that with a high probability we have IE log /ml ^ E log fK,p- This result will be 
used to prove our main theorem. 

Although the actual partition algorithm selects the dimension for partitioning completely at ran¬ 
dom for each iteration, in the proof we will assume one predetermined order of partition (such as 
{1, 2, 3, • • • ,p, 1, 2, • • • }) just for simplicity. The order of partitioning does not matter as long as 
every dimension receives sufficient number of partitions. When the selection is randomly taken, 
with high probability (increasing exponentially with N), the number of partitions in each dimen¬ 
sion will concentrate around the average. Thus, it suffices to prove the result for the simple 
{1,2,3, ••• ,p, 1,2, •••{case. 
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Proof of Theorem 1. The proof consists of two parts, namely (1) bounding the excess loss com¬ 
pared to the optimal approximation fx^p in Tx.p and (2) bounding the error between the optimal 
approximation and the true density. 

For the first part, using the fact that E log fx^p{0) < E log /m l{0), the excess loss can be expressed 
as 

DxhifWfML) - DxL{f\\fK,p) = Elog/K,p(6>) - E log 

= E log fx,p{0) - E log fx,p{0) + E log fx,p{0) 

- E log /ml {0) + E log /ml (6>) - E log /ml {0) 

< E log fx,pm-^ log fK,p (^) + E log /ml (^) - E log /ml {0 ). 

Assuming the partitions for fx,p and /ml are {A^} and {A^} respectively, we have 

K K 

DKL{f\\fML)-DKL{f\\fK,p) = log TTfe logJ 

k=l k=l ^\Ak\ 

< ( max I log TTfc I + max | log f’l | ) sup V jEl^^, - El^j, | • (11) 

V * N\Ak\'J 

Following a similar argument as Lemma 1, for fx,p we can prove that tt/^ = /^^ f{0)d0/\Aj^\, thus 
we have po < TTk < D for any 1 < k < K. Similarly for x^Ak\ ^ 

Therefore, the first term in (11) can be bounded as 

max I log TTk I + niax I log —^— I < 3 maxflog log p~^}. 

^ ' N\Ak\' 

The second term in (11) is the concentration of the empirical measure over all possible K-rectangular 
partitions. Using the result from [1], we have the following large deviation inequality. For any 
e G (0,1), we have 

P E I > < 4 exp I - ^ I, ( 12 ) 

if > max{Ar, (1001og6)/e^, 2^(p + l)Ariog(3eA^/Ar)/e^}. For any ^ > 0, taking e = 2^(p + 
l)Ariog(3eA^/Ff)/A^log(4/^), we have that 


sup ^|E1^^ < 16v/2(p+ l)wEog log (t/ 

/c = l V \ / \ / 

with probability at least 1 — 6/2. Define Cp = 48^2(p + 1) maxjlogD, logp“^}. When N > 
, Lemma 2 holds with probability at least 1 — ^/2. Taking the union bound, we have 


DxhifWfML) < Dxl(/||/o) + C'py ^ log ^^g 

holds with probability greater than 1 — 

To prove the second part, we construct one reference density / G Tx,p that gives the error spec¬ 
ified in the theorem. According to the argument provided in the paragraph prior to this proof, we 
assume the dimension that we cut at each iteration follows an order {1,2,''' ,p, 1,2,''-}. We then 
construct /o in the following way. At iteration i, we check the probability on the whole region. 

i If the probability is greater than 2p, we then cut at the midpoint of the selected dimension. 
If the resulting two blocks Bi and B 2 satisfy that Pf{Bi) > p and Pf{B 2 ) > p, we 
continue to the next iteration. However, if any of them fails to satisfy the condition, we find 
the minimum-deviated cut that satisfies the probability requirement. 
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ii If the probability on the whole region is less than 2p, we stop cutting on this region and 
move to the next region for the current iteration. 


It is easy to show that as long as p < 1/{2K), the above procedure is able to yield a if-block 
partition {Ak} before termination. Finally, the reference density / is defined as 


^ f. f{e)de 

fl^B) = V ' 1 i ( 6 >). 

1^1 A,y ) 


k=l 


(14) 


The construction procedure ensures the following property for / G Assuming if G [2^, 2^+^) 
for some d > 0 , then each must fall into either of the following two categories (could be both), 


1 . p < Pf{Ak) < 2 p, 

2 . Pf{Ak) > p and the longest edge of cube A^ must be less than 2 ~A/p\ < < 


We use ii and I 2 to denote the two different collections of sets. Now for any 60 > 0, let 5 = {/ < 
bo} U {f < bo}. We divide the KL-divergence between / and / into three regions and bound them 
accordingly. 


DKL{f\\f)= //log^ = / /l 0 g^+ / 
JQ I JB t JB 


f -f 

= Ml + M 2 + M 3 . 


, ./log^ + 


L 


/log- 
B=n(u/.) / 


We first look at Mi. Because / is a block-valued function, {/ < bo} must be the union of all the 
Aj^ that satisfies f{0)d0 < bo\Aj^\. Therefore, we have 


[ f{e)de= Y. /- B 0 )de 

df <bo 7 r I T I dAk 




< l>o|^/c| < bo- 


Therefore, we have 


[ f{e)de< [ f{0)de+ [ f{0)de = bo + boM = 2bo. 

JB j f<hn j f<bn 


IB Jf<bo Jf<bo 

Because / > mink P{Ak)/\Ak\ > P{Ak) > p, we have 


Mi= / flogl< [ /(0 )log^ <26o|log^|. 

Jb f Jb P P 


Next, we look at M 2 . It is clear that 


L 


and hence we have 


M 2 = / f\ogL<[ f{0)log^ < car d{Ii)2p\ log P <2Kp\ log A. 

JB^n[\Jii) f JB^n[\Jii) bo bo bo 


f{0)d0 < f{0)d0 < card{Ii)2p, 


/ 


Now for M 3 , we first use the inequality that log x < x — 1 for any x > 0, 

I , j(i ^ 


M,= f , ,/logi< 

JB 


iB^n 


(U^ 2 ) / JB-n{\Ji2) \f 


\f J JB'=n{\Ji2) f boJ\ji2 
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Using the mean value theorem for integration, we have f{0)/\Ak\ = f{0o) for some Oq G 

Ak. Also because / G C^(U), ||/'||oo is bounded, i.e., there exists some constant L such that 
|/(^i) - /(^ 2 )| < L Y!j=i l^ij “ |. Therefore, we have 

l/-/l= E /- l/W-/WI= E /- l/W-/(^o)|< E 4pLfc-^|ifc| <4pLA:-^ 

Akeh 

and thus 



^ ^ ApLD 1 

Ms < —;- K p . 

bo 


Putting all pieces together we have 

DxLifWf) < 26o|log —I + 2Kp\\og^\ + K~K 
P bo' bo 

Now taking bo = K~ 1 /( 2 p) p = K ^ , we have 

DKLifWf) < (logK)K-^ +2ilogD+'^^)K-^ +4pLDK-^ 

2p 

< (2 log K -\-2 log D + ApLD)K~ ^. 


Now defining Ci = 2 log D + ApLD and C 2 = 48a/p + 1 and combining with (13), we have 
DKLifWfML) < (Cl +2\ogK)K~^ +C2max{logL>,21ogK}y^^log log 


□ 


Appendix C: Proof of Theorem 2 

The KD-tree /kd always cuts at the empirical median for different dimensions, aiming to approxi¬ 
mate the true density by equal probability partitioning. For /kd we have the following result. 

Theorem 6. For any e > 0, define = log 2 ^ 2 -\- 3 LIs ^ ^ ^ 0, if N > 

32(log K^K log{2K/6), then with probability at least 1 — have 

\\fKD-fKD\\i<^FpLK p -f AelogK ^^ log * 

If the function is further lower bounded by some constant bo > 0, we can then obtain an upper 
bound on the KL-divergence. Define = log 2 2 + 3 L/bo ^ have 

OklUWIkd) < +8elogK^^^-\og(^^. 

When there are multiple functions and the median partition is performed on pooled data, the partition 
might not happen at the empirical median on each subset. However, as long as the partition quantiles 
are upper and lower bounded by a and 1 — a for some a G [1/2,1), we can establish similar theory 
as Theorem 2. 

Corollary 2. Assume we instead partition at different quantiles that are upper and lower bounded 
by a and 1 — a for some a G [1/2,1). Define = log 2 f ^ + 2 q+ 3 L/£+i 1 = ^^§2 f ^ 
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2jL76o+i )- Porany (5 > 0, if N > 
1 — 5 we have 


(i^j ^)2 i^(log ^og{K/5), then with probability at least 


WfKD - fKDh < e+pLK-^-^ + 

V — a 

and if the function is lower bounded by bo, then we have 


DKLifWfKD) < 

bo 1 — a 




Following the same argument for Theorem 1, we will prove Theorem 2 by assuming a predetermined 
order of partition (such as {1, 2, 3, • • • ,p, 1, 2, • • • }) for simplicity, though in the actual precedure, 
the dimensions are selected completely at random. We need the following two lemmas to prove 

Theorem 2. Let fxD have the same partition as fxD but with function value replaced by the true 

/a f{o)de 

probability on each region divided by the area, i.e., fKD = (^)- 

Lemma 3. With fxD defined above, for any 6 > 0, if N > 32e‘^ {log K)‘^K \og{K /6), then with 
probability at least 1 — have 

WIkd - IrdWi < 4elogFf^ — log 

and 

DKL{f\\fKD) < DKL{f\\fKD) + Sclog kJ log * 


If we instead partition at some different quantiles, which are upper bounded by a and lower bounded 
by 1 — a for some a G [1/2,1), then for any ^ > 0, if N > jYi^;) 2 KilogK)‘^\og{K/5), with 
probability at least 1 — 5 we have 


and 


WJkd - IrdWi < 


2elQg^ j^l-lo«oa-^ 
1 — a 



DxhifWfRD) < DxLifWfRD) H- - - - - 

1 — a 



Proof The proof is based on how close the data median is to the true median. Suppose there are 
Ni points in the current region, condition on this region, and partition it into two regions Ai and 
A 2 by cutting at the data median point Mi . Denote the true median by Mi and two anchor points 
Mi — ei. Mi + 62 such that P{X < Mi — ei) = 1/2 — t and P{X < Mi 62 ) = 1/2 + t for some 
0 < t < 1/2. By Chemoff’s inequality we have 

r PN- 1 

P(M,<M,-6i)<exp|-^| 

and 

r p]\f. 1 

P{Mi >MiP € 2 ) < exp I - 

The above two inequalities indicate that with high probability Mi is within the interval {Mi — 
61 , Mi + 62 ). Therefore, the probabilities on Ai and A 2 also satisfy that 

(15) 
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Now consider the K regions of /kd and d • Each partition will bring an error of at most 1 /2 + t 
to the estimation of the region probability. Therefore, assuming K G [2^ + 1, 2^+^) we have for 
each region (A^ is a random variable) that 


-t]< f{0)dO< 


if Uk/N = 1/2^, or 


/1 \ d-\-l p /1 \ 

if Uk/N = 1/2^+^. Notice that for all iterations before the current partition, we always have 
Ni > N/K. Therefore the probability is guaranteed to be greater than 1 — K exp | — ^ | • 


The above result indicates that 


max / fKD{0) - / fKD{0)\ < max < ( —\-t 


d-\-\ 1 \ 


d+l / 1 \ 




~ \2/ V ^ / ’ 

where i G (0, t). So if t < l/{2d), then (1 + 2t)^ < (1 + 1/dY < e and we have 

Ia <2((i+l)eQ^ 

This result implies that the total variation distance satisfies that 

WIkd - IrdWi = / \fKD{0) - fKD{0)\ = I / fKD{0)- f fKD{0) <4et\ogK. 


Similarly, one can also prove that 

X 4 fKoiO) 

max —- 

!aJkd{0) 


1 < max 


X4, fKDjO) - fKDjO) 
Ja, fKD{0) 


< 4et log K. 


Denote /(6>) = fKoiO) by P{Ak) and fKDf{0) by P{Ak). The KL-divergence can 
then be computed as 

PKLjfWfKp) - PKLjfWfKp) = y~l / f{d){logfKp{0)-logfKp{0)) 


= /Wflog/ /W-log/ fKpiO) 


'^P{Ak){ log 




< (1 + 4et log K)4:et log K < Set log K, 


as long as t < min 


1 1 

4elog K ’ 2 log2 K 


, with probability at least 1—K exp < — K . Consequently, 


for any ^ > 0, if > 32e^K{\og Ky ^og{K/5), then with probability at least 1 — we have 

joK f K\ 

WIkd - IrdWi < 4elog ATw log ( — ) 
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and 


2K 


K 


DklUWIkd) < DklUWIkd) + 8 elogi^y v"^/* 

When the partition occurs at a different quantile, which is assumed to be ai for iteration i, we have 


Uj-T -L p Uj-T -L 

Y[Wi-t)< j /(6»)d6'< P[(a'+t) 

i=l i=l 


where a • = if takes the region containing smaller data values and a • = 1 — if takes 
the other half. First, (15) can be updated as 


P{ I IE I ^ M < exp < — 


3a 


I < exp I 




(17) 


if t < 1 — a. Then we can bound the difference between f{0) and fKD{^) as 


max 

Ak 


Ak J V-y — JA;, 

d-\-\ dj-\-\ dj-\-\ d-\-\ 


n p ( wi-± a-|-± a-\-1. a-\-1. 

/ fKD{0)- fKoiO) < maxi P[(q;'+ i) - P[q;', - P[(a'- t) + P[ a' 

'J Ak I 


-d+l ^ d+l / , \ N a-|-± ^ a-|-± / i \ " 

[n<»i{n(-^)-}.n 4 -n 0 -a}] 


7=i 

d -\-1 


7=i 

(i -|-1 


< max 

d -\-1 


^ n 1)( 1+Y 


i=l 


a / 1 — a 


where i G (0, t). Thus if t < {1 — a)/d, then we have 


max 

Ak 


and 


[ fKD{S) - f fKoA) 

J Ak d Ak 

Iau fKD{0) 


^ e{dpl)ta^^^ ^ 2 etlogFf 


1 — a 


(1 - 


max 


Xt. fKD{0) 


- 1 


< 


The total variation distance follows 

WJkd - IrdWi < K • 
and the KL-divergence follows 


2et log K 


2et log K 
1 — a 


2etlogK ^ 


(1 - Q;)K>°g 2 < 


1 — a 


^l-log 2 a 


D^Mthn) - D^MHUn) < (l + 

\ 1 — ayl — a 1 — a 

iff < l/(2e(l — a) logFf). Consequently, for any ^ > 0, if Ar(log K)‘^ \og{K/5), then 

with probability at least 1 — ^, we have 

life - fell < 


and 


DKLifWfKD) < DKLifWfKD) + • 


□ 
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Our next result is to bound the distance between fxD and the true density /. Again, the proof 
depends on the control of the smallest value of / and the longest edge of every block. One issue 
now is that each partition might not happen at the midpoint, but it should not deviate from the 
midpoint too much given the bound on the /', i.e., we have the following proposition. 

Proposition 1. Assume we aim to partition an edge of length h (on dimension q) of a rectangular 
region A, which has a probability of P and an area of\A\. We distinguish the resulting two regions 
as the left and the right region and the corresponding edges (on dimension q) as hie/t and bright (b^-, 
hieft + bright = b). Suppose the partition ensures that the left region has probability of yP, where 
7 > 1 / 2 . IfWf'Woo < L, then the longer edge /i* = max{/i/e^, bright] satisfies that 

ft - l + Lh'f 


Proof It suffices to bound hi^a as 7 > 1 — 7 . Let g{f) = /(t, x), where t represents 

the variable of dimension q and x stands for the other dimensions. We then have = P, 

^dx=\A\/h^nA 

l5(^i)- 5 (^ 2 )! < [ {f{h,x) - fit 2 ,x)) < L\ti-t 2 \\A\/h. 

Therefore, using the mean value theorem for the integration, we know that 

<L\A\, 


which implies that 


U.h,,9it) 



b\Qft 

bright 


yh 

(1 - 7)ft. 


b\Qft 

bright 


< 


L\A\h 


Now if we solve the following inequality 

17 /a — (1 — 7 )/ 6 | < c and a + 6 = 1 , a > 0 , 6 > 0 , 
with some simple algebra we can get 

1-7 


a < 1 


Plug in the corresponding value, we have 

^left 


1 + c‘ 
1-7 


h l + Lh^y 


□ 


With Proposition 1, we can now obtain the upper bound for ||/ — ||i and OklUWIkd)- 

Lemma 4. For any 5 > 0, define = log 2 + 2 + 3 L/£ ^- ^ 72K \og{K/5) for any ^ > 0, 

then with probability at least 1 — have 

\\f-fKD\\i<ePpLK-^-f. 

If the f is further lower bounded by some bo > 0, the KL-divergence can be bounded as 


OkMIkd) < Pp^K- — 


where n, = logs + 2 + 3 X 75 ^)' 


ho 


Now suppose we instead partition at different quantiles, upper and lower bounded by a and 1 — a 
for some a G (1/2,1). For any 5 > (), if N > K \og(K/5) then the above two bounds hold 

with different and r^Q as 


= log 2 1 


(l-a) 


2 ol “h 3P/5 “h 1 


and = log 2 ( 1 + 


(l-a) 


2q; "h SL/bo "h 1 


19 



Proof. The proof for the total variation distance follows similarly as Theorem 1. For any 5 > 0, we 
consider B = {/kd < e/2}. We then partition the total variation distance formula into two parts 

WIkD - /111 = [ I/kD - /I + / \fKD - /I = Ml + M 2 . 

J B J 

It is straightforward to bound Mi. 5 is a union of which satisfies f{0) < e\Ak\/2. 

Therefore, 

M,< f f + [ fKD= 2 / f{0)<e. 

Ak:lJAk=B 

Now for M 2 , the usual analysis shows that our result depends on the longest edge of each block, i.e., 

M 2 = / Ukd -f\= [ \fKD-f\< P^KAkl =pL\B^\meochl < pLm^hl, 

Ak:UAk=B-Ak:UAk=B- ^ ^ 


where is the longest edge of each block contained in B^. Now using Proposition 1, we know for 
iteration i the partitioned edge at each block follows 


hi< - - —T7rVi-i < ^ Z V»-i- 

V l + Lh'-^J V l + L/eJ 

When K G (2^, 2^+^], each dimension receives [d/p\ stages of partitioning; therefore, we have for 
each block, the longest edge satisfies that 


/i* < 



1-7 A 

1 L/e J 


log2 ^ 

P 




where = log 2 ^ implies 

M 2 <pLK~^ and \\fKD — f\\i^e-\-pLK~^. 


Now, according to (15), we know with probability at least 1 — K exp{—(2Ff)}, 

Taking t = 1/6, we get 


’■■“‘‘’'“U + JtIiT?)’ 

with probability at least 1 — Kexp{—N/(12K)}. So if > 12K\og{K/5), then the probability 
is at least 1 — For the case when 7 = a + t, we choose t = (1 — a)/3, then 

(‘+ 777 ) 

with probability at least 1 — ^ if A^ > K \og{K/5). 

For KL-divergence, if / is lower bounded by some constant 60 > 0, then we know that 

okMUkd) = I m log <jjm - 1) 

< /i-lw In ~ - ^11-^ “ JkoWi- 

Because / and /kd are both lower bounded by bo, we can follow the proof for ||/ — /rdWi with 
e = bo and ignore Mi. Thus we have 


DRLifWfKD) < 


pLD 

——a: , 

bo 
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where = log2 ^ • Similarly, if we take 7 = 2/3 and N > 72K \og{K/S), then with 

probability at least 1 — we have 

If we take 7 = (2Qf + l)/3 and N > K \og{K/S), then with probability at least 1 — we 

have 

1 (1-^) ^ 

’'‘“=l°g'^l,l + 2» + 3L/to + l j^ 

□ 


Theorem 2 and Corollary 2 follow directly from Lemma 3 and 4. 

Proof of Theorem 2 and Corollary 2. For any e > 0, define and as in Lemma 4. Thus, for 
any ^ > 0, if > 32e^(log Ar)^Ariog then with probability 1 — 6/2 we have 

WIkd - IrdWi < 4:e\ogK ^log 

and 

DxLifWfKD) < DKLifWfKo) -^SelogK ^log 

Also, with probability 1 — ^/2 we have 

\\f-fKD\\i<e^pLK-^^, 

and 

DKL{f\\fKD)<^K-^-^. 

Oo 

Putting the two equations together we have 

\\fKD-fKD\\i<s + pLK-^-^+4e\ogK^^log(^^y 
and 

OklUWIkd) < +8elogK^^^-\og(^^. 

Using the same argument on random quantiles, if AT (log AT) ^ log(2Ar/^), then with 

probability at least 1 — ^ we have 

WfKD - fKD\\i <e + pLK-^-f + ^/Og'^ ^i-iog^"-' 

1 — a 

and 



OkMIkd) < + 

Oo 

where and are defined as 

+ 2a + 3Lt+l ) 


46 log K log^ Q, 1 
1 — a 



and 


log2 



2o!, + SL/ho + 1 / 


□ 
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Appendix E: Proof of Theorem 3 and 4 

Lemma 5. Assume ||/||oo < D. Under the same condition as Theorems 1 and 2, if 

Vn > 32cQ^^/2{p+l)Ki+^ ^log log 

then we have W/ml ||oo ^ 2Z) and if 

N > l28e^K{\ogKflog{K/5), 

we have ||/xd||oo < 2,0. 


Proof. Assume 


< D. We want to bound ||/ml||oo and ||/xd||oo- Define 


/=E 






which clearly satisfies f < D. Notice that if there exists some e such that 

max \P{Ak) - -P(Afe)| < e, 

where P{Ak) = E and P{Ak) = E then we have 


II/- /lloo = max 


P{Ak) P{Ak) 


\Ak\ \Ak\ 


^ e/(^) ^ 

< max , < max - 


eD 


A, P{Ak) A, p(^Ak)-e 

Now if we can pick e = min^^ P{Ak)/2, then the upper bound becomes 2D. We deduce the 
corresponding condition for ML-cut and KD-cut respectively. For maximum likelihood partition, 
plug in e = j2 into (12). Under the condition of Theorem 1, if 


AV > 32co ^^/2{p+l)K'^~^^'^> ^\og log 


sr 


then with probability at least 1 — 6/2, we have 

||/ml||oo < ‘2D. 

For median partition, choose e = K~^/2 and apply (16). Under the condition of Theorem 2, if 

N > 12Se^K{logK)^log{K/6), 
then with probability at least 1 — we have 

WIrdWoo < 2D. 


□ 


Proof of Theorem 3. Assume the average total variation distance between and is 5. It can 
be calculated directly as 


llRHe)-Y[R\ 0 ) 


iei 


iei 


d 0 <f 2 [ n n 

i=l j=l l=i+l 

m « 

< (2D)—1 ^ J \R\e) - R\e)\de 


< m( 2 D)'"-ie. 


= 1 ' 
m —1 ^ 
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Letting Z[ = J Hie/ we have 


\Zi-Zi\ = 
Thus 

II// - //111 = 


dO 


i=l 


i=l 


< 


J 


n/«(0)-n/^'n^) 


i=l 


i=l 


dO < m{2D) 


m—1 


^ m ^ m 

1 ri/W(0)-1 ri/w(0) 

.-I Zi 


i=l 


dx = 


/ 




-I 


ZiZi 

Zi Uiei - Zi Uiei H^e/ Uiei 


de 


< — 
- ^/ 


hi 


n/w-!!/(' 

iei iei 

m—l^ 


ZiZi 


dO^—\Zi-Zi\ 


dO 


< —m{2D)^-^e. 
Zi 


□ 


Proof of Theorem 4. Assuming m G [2^, 2^+^), then after 5 + 1 iterations, we will obtain our final 
aggregated density. At iteration /, each true density is some aggregation of the original m densities, 
which can be represented by ///, where /' is the set of indices of the original densities. Let 
be the total variation distance between the true density and the approximation for the pair (A, ^ 2 ) 
caused by combining. For example, when / = 1, Ji, /2 contain only a single element, i.e., Ii = {ii} 
and I 2 = {^2}- Recall that Co = max///(^7/c/ ZiifZji^jn /Zji, using the result from Theorem 3, we 
have 


Alld2) _ 

j(L) J(i2) 

j(n) j(i2) 

— 

//(n)/fe) 

//(n)/(L) 


_ = 

//(n)/(.+ 


2 r f(L) r fii 2 ) 


2Ds < ACoDe. 


We prove the result by induction. Assuming we are currently at iteration / + 1, and the paired two 
densities are //^ and fi^ where Ii^h ^ T By induction, the approximation obtained at iteration I 
are //^ and fj^ which satisfies that 


II //1 - 4 111 < {iCoDYs, Wfi, - +II 1 < {iCoDYs. 


Using Theorem 3 again, we have that 


,(L,/2) _ 
‘^l + l ~ 


< 


fh fl2 _ fh fl2 

J fhfh J fllfl2 
ln.eiJ^VYl.ei2 

f niG/iu/2 


^ 2 f II/l ~ /filli + II //2 ~ //2II1 

“ J fhfh I 2 

f(i) 

- — {AD) ■ {ACqDYs < (4CoD)'+^e 


} 


Consequently, the final approximation satisfies that 


II// - //111 < {ACoD)^+h < (4CoZ))'°8^'"+'£. 


□ 


Appendix D: Supplement to Two Toy Examples 

Bimodal Example Figure 7 compares the aggregated density of PART-KD/PART-ML for several 
alternative combination schemes to the true density. This complements the results from one-stage 
combination with uniform block-wise distribution presented in Figure 1 of the main text. 
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Figure 7: Bimodal posterior combined from 10 subsets. The results from PART-KD/PART-ML mul¬ 
tiscale histograms are shown for (1) one-stage combination with local Gaussian smoothing (2) pair¬ 
wise combination with local Gaussian smoothing. 


Rare Bernoulli Example The left panel of Figure 8 shows additional results of posteriors aggre¬ 
gated from PART-KD/PART-ML random tree ensemble with several alternative combination strate¬ 
gies, which complement the results presented in Figure 2 of the main text. All of the produced 
posteriors correctly locate the posterior mass despite the heterogeneity of subset posteriors. The 
fake “ripples” produced by pairwise ML aggregation are caused by local Gaussian smoothing. 

Also, the right panel of Figure 8 shows that the posteriors produced by nonparametric and semipara- 
metric methods miss the right scale. 
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Figure 8: Posterior of the probability ^ of a rare event combined from M = 15 subsets of in¬ 
dependent Bernoulli trials. Left: the results from KD/ML multiscale histograms are shown for (1) 
one-stage combination with local Gaussian smoothing (2) pairwise combination with local Gaussian 
smoothing. Right: posterior aggregated from nonparametric and semiparametric methods. 


Appendix F: Supplement to Bayesian Logistic Regression 

Figure 9 additionally plots the prediction accuracy against the length of subset chains supplied to 
the aggregation algorithms, for Bayesian logistic regression on two real datasets. For simplicity, 
the same number of posterior samples from all subset chains are aggregated, with the first 20% 
discarded as bum-in. As a reference, we also show the result for running the full chain. As can 
be seen from Figure 9, the performance of PART-KD/ML agrees with that of the full chain as the 
number of posterior samples increase, validating the theoretical results presented in Theorem 1 and 
Theorem 2 in the main text. 
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Logistic regression on Covertype dataset 



Logistic regression on MiniBooNE dataset 
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Figure 9: Prediction accuracy versus the length of subset chains on the covertype and the MiniBooNE 
dataset. 
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